
General workflow to collate, format, data enrichment and check
download-and-clean-gbif.RmdIntroduction
This vignette demonstrates how to use the functions in the danubeoccurR package to download and clean fish species occurrence records from the Global Biodiversity Information Facility (GBIF) for the Danube River Basin. It covers the steps to filter, download, and clean the data to prepare it for further analysis.
Before starting, ensure that you have:
- An active internet connection to access GBIF data.
- A GBIF API key.
Additionally, ensure that you follow the
installation
guideline of the package hydrographr for your operating
system.
Step 1: Install the Package from GitHub
To get started, you need to install the danubeoccurR, hydrographr and specleanr packages from GitHub. You can do this using the devtools package:
# Install the devtools package if you haven't already
install.packages("devtools")
# Install the danubeoccurR package from GitHub
devtools::install_github("ytorres-cambas/danubeoccurR")
devtools::install_github("glowabio/hydrographr")
devtools::install_github("AnthonyBasooma/specleanr")
library(danubeoccurR)
library(hydrographr)
library(specleanr)
library(dplyr)
library(sf)Step 2: Downloading GBIF Data
After installing the package, you can download fish species
occurrences using the download_gbif_records() function.
This will fetch records that fall within the geographic boundary of the
Danube River Basin.
# Define species to download. We will use only three species from a check list
# of fish species form the Danube River Basin. The complete check list is data
# set include with the package danubeoccurR
species_list <- species_checklist[1:2]
print(species_list)
#> [1] "Abramis brama" "Acipenser baerii"
# Define a geographic bounding box to use for querying GBIF occurrences within
# the specified area. The bounding box should be a simple feature (sf) object,
# typically imported into R with sf::read_sf().
# In this example, we use an sf object that is included in the danubeoccurR
# package, so there is no need to import it manually.
bbox_danube_basin
#> Simple feature collection with 1 feature and 1 field
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: 8 ymin: 42 xmax: 30 ymax: 51
#> Geodetic CRS: WGS 84
#> gid geom
#> 1 1 POLYGON ((8 51, 30 51, 30 4...
# Convert the polygon to WKT format using the `get_wkt` function.
# This format is required for GBIF. The function takes an sf polygon object,
# ensures that the polygon has a counterclockwise winding order,
# and returns the WKT (Well-Known Text) representation of the polygon.
wkt <- get_wkt(bbox_danube_basin)
print(wkt)
#> [1] "POLYGON ((8 51, 8 42, 30 42, 30 51, 8 51))"This function retrieves fish species occurrences from GBIF based on
the species list and the polygon that represents the Danube River Basin.
We set import_to_r = TRUE to import the occurrences into R.
The function returns a list with a data frame with occurrences and data
for citation.
# Get the path to the temporary directory to save occurrences
temp_path <- tempdir()
# Download occurrences from GBIF
gbif_data <- download_gbif_records(species_names = species_list,
wkt = wkt,
output_occur_path = temp_path,
gbif_user = "your_username",
gbif_pwd = "your_password",
gbif_email="your_email@example.com",
import_to_r = TRUE)Step 3: Cleaning GBIF Data
Once the occurrences are downloaded, we will clean the data using
clean_gbif(). This function removes duplicates, invalid
coordinates, and other quality issues. This ensures that the data is
ready for analysis, without missing or erroneous records.
# Clean the downloaded GBIF records
gbif_cleaned <- clean_gbif(gbif_data$raw_download,
coordinateUncertaintyInMeters = 1000,
coordinatePrecision = 0.01,
buffer_centroid_zoo_herbaria = 1000,
buffer_centroid_capitals = 1000,
buffer_centroid_countries = 1000)
#> Testing country centroids: Removed 4 records.
#> Testing country capitals: Removed 1 records.
#> Testing biodiversity institutions: Removed 27 records.
#> Testing missing year or species: Removed 271 records.
#> Retained 2149 out of 2453 records after cleaning.As we can see, 2149 out of 2453 records were retained after the cleaning procedure.
# Number of records before cleaning
nrow(gbif_data$raw_download)
#> [1] 2453
# Number of records after cleaning
nrow(gbif_cleaned)
#> [1] 2149
# Preview the cleaned data
head(gbif_cleaned)
#> # A tibble: 6 × 50
#> gbifID datasetKey occurrenceID kingdom phylum class order family genus
#> <int64> <chr> <chr> <chr> <chr> <lgl> <chr> <chr> <chr>
#> 1 4.55e-315 6ac3f774-d9fb-… "" Animal… Chord… NA Cypr… Cypri… Abra…
#> 2 4.55e-315 6ac3f774-d9fb-… "" Animal… Chord… NA Cypr… Cypri… Abra…
#> 3 4.55e-315 6ac3f774-d9fb-… "" Animal… Chord… NA Cypr… Cypri… Abra…
#> 4 4.55e-315 6ac3f774-d9fb-… "" Animal… Chord… NA Cypr… Cypri… Abra…
#> 5 4.55e-315 6ac3f774-d9fb-… "" Animal… Chord… NA Cypr… Cypri… Abra…
#> 6 4.55e-315 6ac3f774-d9fb-… "" Animal… Chord… NA Cypr… Cypri… Abra…
#> # ℹ 41 more variables: species <chr>, infraspecificEpithet <chr>,
#> # taxonRank <chr>, scientificName <chr>, verbatimScientificName <chr>,
#> # verbatimScientificNameAuthorship <chr>, countryCode <chr>, locality <chr>,
#> # stateProvince <chr>, occurrenceStatus <chr>, individualCount <int>,
#> # publishingOrgKey <chr>, decimalLatitude <dbl>, decimalLongitude <dbl>,
#> # coordinateUncertaintyInMeters <dbl>, coordinatePrecision <dbl>,
#> # elevation <dbl>, elevationAccuracy <dbl>, depth <lgl>, …Step 4: Visualizing the Occurrences
We can now visualize the occurrences on a map using
visualize_points().
visualize_points(points_df = gbif_cleaned,
layer = danube_basin)As shown, some occurrence records fall outside the Danube River
Basin. This happens because we used a bounding box as the spatial
reference when querying GBIF. To limit the data strictly to the Danube
River Basin, we can apply spatial filtering using the
get_spatial_subset() function. This function filters the
points based on a layer representing the Danube River Basin, which is
included as one of the datasets in the danubeoccurR
package.
# Spatial filtering of occurrence records
gbif_filtered <- get_spatial_subset(danube_basin,
gbif_cleaned,
"decimalLatitude",
"decimalLongitude",
verbose = TRUE)Now, the map displays only the occurrence records that are strictly located within the Danube River Basin.
visualize_points(points_df = gbif_filtered,
layer = danube_basin)Step 6: Checking species names
To clean the species names and ensure consistency with FishBase, we
will use the check_names() function from the package
specleanr. This function matches species names to FishBase
and, by default, returns the accepted name if a synonym is provided. The
function also checks for spelling errors and suggests the closest match
in FishBase based on the similarity threshold defined by the pct
parameter. We will use the default pct value of 90 that indicates a
match of 90% or above is required.
# Check species names for inconsistencies
gbif_name_checked <- check_names(data = gbif_filtered,
colsp = "species",
verbose = FALSE,
pct = 90,
merge = F,
sn = FALSE)
# Preview the species names
print(gbif_name_checked)As you can see, the species names in the species column,
which correspond to the species names in our dataset, match the names in
the speciescheck column from FishBase, indicating that no
changes are required.
Step 7: Add additional taxonomic information
If taxonomic information for Genus, Family, or Order is missing from some occurrences, it can be retrieved from FishBase and added. While the example dataset does not have missing taxonomic information, its classification differs from that used in FishBase. Therefore, we will update the taxonomic classification in the example dataset based on the information retrieved from FishBase. We will use the species name list obtained in the previous step after verifying species names against FishBase.
# Return genus, family and order of the species in the input vector.
tax_info <- get_taxonomy_info(species = gbif_name_checked$speciescheck, sources = "fishbase")
# Visualize result
print(tax_info)
# Visualize taxonomic classification used in the example dataset.
gbif_filtered %>%
select(species, genus, family, order) %>%
distinct(species, .keep_all = TRUE) %>%
print()The result is a data frame containing columns for Species, Genus, Family, and Order. In this case, the species Abramis brama is classified under the Family Leuciscidae; however, it was classified under Cyprinidae in the example dataset downloaded from GBIF. We will now proceed to update the Family classification of Abramis brama.
# Update taxonomic information
gbif_filtered[gbif_filtered$family=="Cyprinidae", "family"] <- "Leuciscidae"Step 8: Add administrative information
Administrative information, such as country name and lower
administrative levels (e.g., administrative levels 1 and 2 from the
Database of Global Administrative Areas), can be added by performing a
spatial join between points and GADM Administrative Boundaries using the
spatial_join_gadm() function. If a vector layer is not
supplied but an ISO3 country code is provided, the function attempts to
download GPKG files from GADM and use them for the spatial join. If the
download fails due to a slow connection, the timeout
argument value can be increased to allow more time for the process. As a
final step, the columns are renamed to align with Darwin Core standards. In this case,
GADM administrative levels 1 and 2 correspond to the Darwin Core terms
stateProvince and county, respectively.
Therefore, we will rename the columns accordingly.
# Perform a spatial join of points with GADM Administrative Boundaries
admin_4_gbif <- spatial_join_gadm(
coords_df = gbif_filtered,
country_codes = c("ALB", "AUT", "BIH", "BGR", "CHE", "CZE", "DEU", "HUN","ITA", "HRV",
"MDA", "MKD","MNE", "POL", "ROU", "SRB", "SVK", "SVN", "UKR", "XKO"),
lat_col = "decimalLatitude",
lon_col = "decimalLongitude",
timeout = 1000)
# Rename columns
admin_4_gbif <- admin_4_gbif %>%
select(-stateProvince) %>% # delete original column
rename(country = COUNTRY,
county = NAME_2,
stateProvince = NAME_1)Step 9: Extracting sub-catchment ids
Next, we will extract the IDs of the sub-catchments where each
occurrence is located and add a new column with these IDs to the
dataset. To achieve this, we will use the extract_ids()
function from the
hydrographr
package. The sub-catchment layer was generated from the
Hydrography90m
dataset by downloading the tiles corresponding to the Danube River
Basin and cropping them using functions from the
hydrographr
package.
Lets first download, crop and merge raster tiles with sub-catchments.
# Folder to save final Hydrography90m layers
# Create output folder if it doesn't exist
output_hydro90m <- paste0(getwd(), "/hydrography90m")
if(!dir.exists(output_hydro90m)) dir.create(output_hydro90m)
# Folder to save temporal files
temp <- paste0(getwd(), "/temp")
if(!dir.exists(temp)) dir.create(temp)
# Define tile ID
id <- c("h18v02", "h18v04", "h20v02", "h20v04")
# Define raster variable
r_var <- c("sub_catchment")
# raster files
download_tiles(variable = r_var,
file_format = "tif",
tile_id = id,
download_dir = temp)
# Get the full paths of the raster tiles
raster_dir <- list.files(paste0(temp, "/r.watershed"), pattern = ".tif",
full.names = TRUE, recursive = TRUE)
# Crop raster tiles to the extent of a sf object
for(itile in raster_dir) {
crop_to_extent(raster_layer = itile,
bounding_box = bbox_danube_basin,
out_dir = temp,
file_name = paste0(str_remove(basename(itile), ".tif"),
"_tmp.tif"))
}
# Merge filtered raster tiles and save result to a directory
merge_tiles(tile_dir = temp,
tile_names = list.files(temp,
full.names = FALSE,
pattern = "_tmp.tif"),
out_dir = output_hydro90m,
file_name = "subcatch_danube.tif",
name = "ID",
read = FALSE,
compression = "high",
quiet = FALSE)Now, we are ready to extract the sub-catchment IDs.
# Create unique ids for rows in gbif_filtered. It is required by the function
# extract_ids. It will be removed later
admin_4_gbif$id_col <- seq(1:nrow(admin_4_gbif))
# Extract sub-catchment IDs
ids_df <- extract_ids(
data = admin_4_gbif,
lon = "decimalLongitude",
lat = "decimalLatitude",
id = "id_col",
basin_layer = NULL,
subc_layer = paste0(getwd(), "/hydrography90m/subcatch_danube.tif"),
quiet = FALSE
)
# Check all occurrences have sub-catchment IDs
if(length(which(is.na(ids_df$subcatchment_id))) == 0) {
print("All occurrences have sub-catchment IDs.")
} else {
print(paste0(length(which(is.na(ids_df$subcatchment_id))),
" occurrences are missing sub-catchment IDs."))
}
# New column with sub-catchment IDs
gbif_subcID <- admin_4_gbif %>%
mutate(subcatchmentID = ids_df$subcatchment_id) %>%
select(-id_col) # delete column with row ids Step 10: Add occurrence IDs
To ensure each record has a unique and persistent identifier, we
assigned Universal Unique Identifiers (UUIDs) using the
generate_global_identifier() function. This aligns with
Darwin Core standards and facilitates data integration across
biodiversity databases.
First, we identified records that lacked an
occurrenceID. For these records, we generated and assigned
a UUID using the generate_global_identifier() function.
This step ensures that all occurrences have a globally unique
identifier, improving traceability and interoperability with other
datasets.
# Identify records that lack an occurrenceID
indx <- which(gbif_subcID$occurrenceID=="")
# Generate occurrence IDs
ids <- generate_global_identifier(n = length(indx))
# Add ids
gbif_subcID[indx,"occurrenceID"] <- idsStep 11 Check data format and completeness
To ensure data consistency and compliance with Darwin Core standards, we will perform a series of format checks.
Column names were verified using the
check_column_name()function to confirm they conform to Darwin Core terms as well as additional custom names.Year values were checked for missing entries and validated to be numeric and within an acceptable range using the
check_year_column()function. The function output is a list of: **missing_values: Indices of missing values in the year column. **invalid_years: Indices of values that fall outside the valid year range. **updated_df: A data frame with the updated year values and a flag indicatingCoordinate validation was conducted using the
check_coordinates()function to ensure latitude and longitude values follow the correct decimal degree format (WGS84).
These checks help maintain data quality and facilitate seamless integration with biodiversity databases. Finally, to ensure data completeness we will verify that each record includes at least a species name, date, and coordinates.
# Check column name
check_column_name(df_input = ,
standard_terms = c(dwc_names, "waterBodyType", "gbifID", "subcatchmentID"),
verbose = TRUE)
# Check for missing values, ensure that values are numeric and fall within a valid range. The result is a list with three elements that store "missing_values",
gbif_year_checked <- check_year_column(df = gbif_det_outlier,
col_year = "year",
year_range = c(1800, 2024))
print(gbif_year_checked$missing_values)
print(gbif_year_checked$invalid_years)
# Check coordinates format and ensure they are numeric data type
gbif_coord_checked <- check_coordinates(df = gbif_year_checked$updated_df,
lat_col = "decimalLatitude",
lon_col = "decimalLongitude",
convert_to_numeric = TRUE,
verbose = TRUE)
# Check data completeness
Conclusion
This vignette shows how to download and clean fish species occurrence data from GBIF for the Danube River Basin using the danubeoccurR package. The cleaned data can be further analyzed or used for conservation strategies. This pipeline can also be integrated into broader biodiversity assessment workflows.